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Abstract 

Random intersection graphs (RIGs) are an important random structure with applications in social net- 
works, epidemic networks, blog readership, and wireless sensor networks. RIGs can be interpreted as a 
model for large randomly formed non-metric data sets. We analyze the component evolution in general RIGs, 
and give conditions on existence and uniqueness of the giant component. Our techniques generalize existing 
methods for analysis of component evolution: we analyze survival and extinction properties of a dependent, 
inhomogeneous Galton- Watson branching process on general RIGs. Our analysis relies on bounding the 
branching processes and inherits the fundamental concepts of the study of component evolution in Erdos- 
Renyi graphs. The major challenge comes from the underlying structure of RIGs, which involves its both 
the set of nodes and the set of attributes, as well as the set of different probabilities among the nodes and 
attributes. 

Keywords: Random graphs, branching processes, probabilistic methods, random generation of combinatorial 
CJ , structures, stochastic processes in relation with random discrete structures. 

^H ■ 

^ ■ 1 Introduction 

l/"") ' Bipartite graphs, consisting of two sets of nodes with edges only connecting nodes in opposite sets, are a natural 
ly) ■ representation for many networks. A well-known example is a collaboration graph, where the two sets might be 
scientists and research papers, or actors and movies ll25llT6l . Social networks can often be cast as bipartite graphs 
since they are built from sets of individuals connected to sets of attributes, such as membership of a club or orga- 
nization, work colleagues, or fans of the same sports team. Simulations of epidemic spread in human populations 
are often performed on networks constructed from bipartite graphs of people and the locations they visit during a 
typical day ifTTTl . Bipartite structure, of course, is hardly limited to social networks. The relation between nodes 
and keys in secure wireless communication, for examples, forms a bipartite network (6). In general, bipartite 
graphs are well suited to the problem of classifying objects, where each object has a set of properties ifTUl . How- 
ever, modeling such classification networks remains a challenge. The well-studied Erdos-Renyi model, G U;P , 
successfully used for average-case analysis of algorithm performance, does not satisfactorily represent many 
randomly formed social or collaboration networks. For example, G n;p does not capture the typical scale-free 
degree distribution of many real- world networks [3]. More realistic degree distributions can be achieved by the 
configuration model [18] or expected degree model [7], but even those fail to capture common properties of 
social networks such as the high number of triangles (or cliques) and strong degree-degree correlation ifTTl fTI. 

The most straightforward way of remedying these problems is to characterize each of the bipartite sets separately. 
One step in this direction is an extension of the configuration model that specifies degrees in both sets lTT4l . 
Another related approach is that of random intersection graphs (RIG), first introduced in ll24l[T5l . Any undirected 
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graph can be represented as an intersection graph |5). The simplest version is the "uniform" RIG, G(n,m,p), 
containing a set of n nodes and a set of m attributes, where any given node-attribute pair contains an edge with 
a fixed probability p, independently of other pairs. Two nodes in the graph are taken to be connected if and only 
if they are both connected to at least one common element in the attribute set. In our work, we study the more 
general RIG, G(n, m,p) ll20l[T9l . where the node-attribute edge probabilities are not given by a uniform value p 
but rather by a set p = {p w } W £\y- a node is attached to the attribute w, with probability p w . This general model 
has only recently been developed and only a few results have obtained, such as expander properties, cover time, 
and the existence and efficient construction of large independent sets I20l[l9ll2ll . 

In this paper, we analyze the evolution of components in general RIGs. Related results have previously been 
obtained for the uniform RIG [4], and for two uniform cases of the RIG model where a specific overlap threshold 
controls the connectivity of the nodes, were analyzed in [6]. Our main contribution is a generalization of the 
component evolution on a general RIG. We provide stochastic bounds, by analyzing the stopping time of the 
branching process on general RIG, where the history of the process is directly dictated by the structure of the 
general RIG. The major challenge comes from the underlying structure of RIGs, which involves both the set of 
nodes and the set of attributes, as well as the set of different probabilities p = {p w }wew- 



2 Model and previous work 

In this paper, we will consider the general intersection graph G(n,m,p), introduced in l20l|l9h with a set of 
probabilities p = {p w } W £w> where p w G (0, 1). We now formally define the model. 

Model. There are two sets: the set of nodes V = {1,2, . . . ,n} and the set of attributes W = {1,2, . . . , to}. For 
a given set of probabilities p = {p w } w ^w-> independently over all (v, w) € V X W let 

A v>w := Bernoulli (p w ). (1) 

Every node v € V is assigned a random set of attributes W{v) C W 

W{v) := {w C W | A v>w = 1}. (2) 

The set of edges in V is defined such that two different nodes v% , Vj G V are connected if and only if 

|W(^)rW(^)l>s, (3) 

for a given integer s > 1. 

In our analysis, p w are not necessarily the same as in H |6l q and for simplicity we fix s = 1. 

The component evolution of the uniform model G(n, m,p) was analyzed by Behrisch in |4j, for the case when 
the scaling of nodes and attributes is m = n a , with a/1 and p 2 m = c/n. Theorem 1 in H states that the size 
of the largest component M(G(n, m,p)) in RIG satisfies (i) J\f(G(n, m,p)) < qJ* a\ logn, for a > 1, c < 1, 

(ii) M(G(n,m,p)) = (1 + o(l))(l - p)n, for a > l,c > 1, (in) M{G{n,m,p)) < (^y^ log to, for 
a < l,c < 1, (iv) J\f(G(n,m,p)) = (1 + o(l))(l — p)y/cmn, for a < l,c > 1, where pis the solution in (0, 1) 
of the equation p = exp(c(p — 1)). 

The component evolution for the case s > 1 in the relation | W(u) n W(u)| > s is considered in [6], where the 
following two RIG models are analyzed: (1) G s (n, to, d) model, where P[W(u) = A] = (™) for all A C W 
on d elements, for a given d; (2) G' s (n, to, p) model, where P[W(u) = A] = p' A ' (1 — p) m ~\ A \ for all A C W. 
In light of results of @, it has been shown in [6], that for d = d{n),p = p(n),m = m(n),n = o(m), where 



Note that p w 's do not sum up to 1. Moreover, we can eliminate the cases p w = and p w — 1. These two cases respectively 
correspond when none or all nodes v are attached to the attribute w. 



s is a fixed integer, and d 2s ~ cm s s\/n, the largest component in G s (n, m, d) satisfies: (i) J\f{G s {n, m, d)) < 
7tJ-t; logn, for c < 1, (ii) Af(G s (n,m, d)) = (1 + o(l))(l — p)n, for c > 1, in the case when nlogn = o(m) 

for s = 1 and n = o(m s ^ 2s ^ 1 ^) for s < 2. The same results for the giant component in G s (n, m,p) still hold 
for the case when p 2s = cs\/m s n and n = c^ttW- 25-1 )), see (6j. 

Both G s (n, m, d) and G' s (n, m,p) are special cases of a more general class studied in lfT3l . where the number 
of attributes of each node is assigned randomly as in the bipartite configuration model. That is, for a given 
probability distribution (Pq, Pi, . . . ,P m ), we have P[|W(i>)| = k] = Pk for all < k < m, and moreover 
given the size k, all of the sets W(v) are equally probable, that is for any A C W, F[W(v ) = A : \W(v)\ = 
k] = (T) . That is, we see that G s (n,m,d) is equivalent to the model of [13] with the delta-distribution, 
where the probability of the d-th coordinate is 1, while G' s (n, m,d) is equivalent to the model of lfl3ll with the 
Bin(m,p) distribution. To complete the picture of previous work, in (5|, it was shown that when n = m a set of 
probabilities p = {p w } W £\y can be chosen to tune the degree and clustering coefficient of the graph. 



3 Mathematical preliminaries 

In this paper, we analyze the component evolution of the general RIG structure. As we have already mentioned, 
the major challenge comes from the underlying structure of RIGs, which involves both the set of nodes and the 
set of attributes, as well as the set of different probabilities p = {p w } W £\y- 

Moreover, the edges in RIG are not independent. Hence, a RIG cannot be treated as an Erdos-Renyi random 
graph G nt p, with the edge probability p = 1 — n«>ew(l — Pw)- However, in [12], the authors provide the 
comparison among G n ^ and G(n, m,p), showing that for m = n a and a > 6, these two classes of graphs have 
asymptotically the same properties. In 11231 . Rybarczyk has recently shown the equivalence of sharp threshold 
functions among G n .p and G n ^ m ^ p , when m > n 3 . In this work, we do not impose any constraints among n and 
TTi, and we develop methods for the analysis of branching processes on RIGs, since the existing methods for the 
analysis of branching processes on G n:P do not apply. 

We now briefly state the edge dependence. Consider three distinct nodes Vi , Vj , v^ from V. Conditionally on the 
set W{vk), by the definition ©, the sets W(vi) n W(vk) and W(vj) D W(vk) are mutually independent, which 
implies conditional independence of the events {v i ~ Vk \ W(vk)}, {vj ~ Vk \ W(vk)}, that is, 

F[ Vi ~ v k , vj ~ v k | W(v k )} = F[ Vi ~ v k | W{v k )]¥[ Vj ~ v k \ W{v k )\. (4) 

However, the latter does not imply independence of the events {yi ~ Vk} and {vj ~ v k} since in general 

F[vi ~ v k ,Vj ~ Vk] = E[P[u, ~ Vk, Vj ~ v k | W(v k )] 

= E [F[ Vi ~ v k | W(vk)]F[vj ~ v k | W(v k )]] 

+ F[ Vi ~ v k ]F[vj ~ v k ]. (5) 

Furthermore, the conditional pairwise independence (@]) does not extend to three or more nodes. Indeed, con- 
ditionally on the set W(vk), the sets W{vi) fl W(vj), W(vi) fl W(vk), and W(vj) fl W(vk) are not mutually 
independent, and hence neither are the events {yi ~ Vj}, {v,i ~ Vk], and {vj ~ Vk}, that is, 

F[ Vi ~ uj, vt ~ Ufe, Uj ~ w fe | W(« fc )] / P[v« ~ Vj | W(t> fc )]P[ui ~ v k | WK)]Ph- ~ v fc | W(v k )]. (6) 

We now provide two identities, which we will use throughout this paper. For any w £ W, let q w := 1 — p w , and 
define J] ae 9a = 1. 

Claim 1 For any node u £ V and given set A C VF, 

P[W(«) n A = 0|^] = Y[(l-p a )= Y[q a . (7) 



Proof Write 

P[W(u) r\A = $\A] = P[Va G A, a $ W(u)\A\ = JJ P[a g W(u)] = JJ (1 — j? tt ) = JJ g a , 

oeA aeA aeA 

which is the desired expression. ■ 



Claim 2 For any node u £ V, and given sets A C B C W, 



aeA aeB\A aSA /3eB 

Proof The sets A and B \ A are disjoint. The result follows from f7]). ■ 



4 Auxiliary process on general random intersection graphs 

Our analysis for the emergence of a giant component is inspired by the approach described in [2]. The difficulty 
in analyzing the evolution of the stochastic process defined by equations ([T]), ©, and © resides in the fact that 
we need, at least in principle, to keep track of the temporal evolution of the sets of nodes and attributes being 
explored. This results in a process that is not Markovian. 

We construct an auxiliary process, which starts at an arbitrary node vq E V, and reaches zero for the first time 
in a number of steps equal to the size of the component containing vq. The process is algorithmically defined as 
follows. 

Auxiliary Process. Let us denote by V t the cumulative set of nodes visited by time t, which we initialize to 

Vq = {vq}, and set W{vq) = {v / vq : W(v) fl W{vq) / 0}. Starting with Yq = 1, the process evolves as 
follows: For t = 1, 2, 3, . . . , n — 1 and Y t > 0, pick a node v t uniformly at random from the set V \ Vt—\ and 
update the set of visited nodes Vt = Vt-i U {v t }. Denote by W(vt) = {w G W \ A Vt>w = 1} the set of features 
associated to node vt, and define 



Y t 



[v G V \ V t | W(v) n U* T=0 W> r ) + 0} 



The random variable Y t counts the number of nodes outside the set of visited nodes Vt that are connected to Vt. 
Following [2], we call Y t the number of alive nodes at time t. We note that we do not need to keep track of the 
actual list of neighbors of Vt 

{vev\v t \w(v)n uUoWivr) + 0} , (8) 

as in O, because every node in V \ Vt is equally likely to belong to the set ([8]). As a result, each time we need a 
random node from (H), we pick a node uniformly at random form V \ Vt. 

To understand why this process is useful, notice that by time t, we know that the size of the component containing 
vq is at least as large as the number of visited nodes Vt plus the number Y t of neighbors of Vt not yet visited. 
Once the number Y t of neighbors connected to Vt but not yet visited drops to zero, the size of Vt is equal to the 
size of the component containing vq. We formalize this last statement by introducing the stopping time 

T(vo) = inf{£ > : Y t = 0}, (9) 

whose value is |C(uo)|- 

Finally, our analysis of that process requires us to keep track of the history of the feature sets uncovered by the 
process 

Ht = {W(v ),W(v 1 ),...,W(v t )}. (10) 



4.1 Process description in terms of random variable Y t 

As in (6), we denote the cumulative feature set associated to the sequence of nodes vq, . . . ,vt from the auxiliary 
process by 

W [t] := ^ T=0 W{v T ). (11) 

We will characterize the process {Y t }t>o in terms of the number Z t of newly discovered neighbors to Vt. The 
latter is directly related to the increment, defined by of the process Y t 

Z t = Y t - Y t ^ + 1, (12) 

where the term +1 reflects the fact that one node, Y t -\ decreases by one when the node vt becomes a visited 
node at time t. The events that any given node, which is neither visited nor alive, becomes alive at time t are 
conditionally independent given the history Tit, since each event involves a different subsets of the indicator 
random variables {A VjW }. In light of Claim|2l the conditional probability that a node u becomes alive at time t is 

r t ■ = P[u~ v t ,w/> v t -i,u^ v t -2, ■ ■ ■ ,u ^ VQ\H t ] 
= F[W{u) n W(vt) + 0, W(u) n W [t _ x] = ®\H t ] 
= F[W(u) n W(v t ) + 0, W{u) n W [t _i] = %\W{v t ), w [t „ A ] 

n q a - n ^ 

= </h-i - 4>t, (13) 

where we set <f) t := Ylaew 9a, and use the convention W_i] = W(0) = and </>_i = 1. Observe that the 
probability (TT3T ) does not depend on u. Hence the number of new alive nodes at time t is, conditionally on the 
history T-L t , a Binomial distributed random variable with parameters r t and 

N t = n-t-Y t . (14) 

Formally, 

Z t+1 \H t ~Bin(N t ,r t ). (15) 

This allows us to describe the distribution of Y t in the next lemma. 

Lemma 3 For times t > 1, the number of alive nodes satisfies 

t-i 
Y t \Ht-i~Bin(n-l,l-Y[(l-r T )) -t + 1. (16) 

r=0 

The proof of this lemma requires us to establish the following result first. 

Lemma 4 Let random variables Ai,A2 satisfy: Ai ~ Bin(m, i^i) and A2 given Ai ~ Bin(Ai,z/2)- T/jen 
marginally A2 ~ Bin(m, ^1^2) <2«<i Ai — A2 ~ Bin(m, i/i(l — ^2))- 

Proof Let I7i, . . . , U m and Vi, ... , V m be i.i.d. Uniform(0, 1) random variables. Writing 

m 

M^^HUj <vi) and A 2 |Ai= Y. KVk<^), 

j = l k:U k <v x 

we have that 

m m 

A 2 = J^l(U k < vJUVk < v 2 ) = Y, l (U k < vxV2% 
fe=i fc=i 



from which the conclusion follows. ■ 

Proof (Proof of Lemma [3]) We prove the assertion on the Lemma by induction in t. For t = 0, Y = 1 and 

t = 1, Y\ = Z\ ~ Bin(n — 1, To). Hence, the Lemma is true for t = 1 and t = 0. Assume that the assertion is 

true for some t > 1, 

t-i 



l r t|Wt-i~Bin(n-l,l-IJ(l-Ty)) -t + 1. (17) 

r=0 

From (fl"5T ). we have Z t +i\Ht ~ Bin(A r t , r t ) = Bin(n — t — Yt, r t ), Now, from ([T2l and LemmaHJ it follows 

4 

F t+ i|%~Bm(n-l,l- II(l-r T )) -t. (18) 

r=0 

Hence, by mathematical induction, the Lemma holds for any t > 0. ■ 



4.2 Expectation and variance of 4> t 

The history T~L t embodies the evolution of how the features are discovered over time. It is insightful to recast that 
history in terms of the discovery times T w of each feature in W . Given any sequence of nodes vq, Vi, V2, ■ ■ ■, the 
probability that a given feature w is first discovered at time t < n is 

If a feature w is not discovered by time n — 1, we set T w = oo and note that 

p[r™ = oo] = (i-p„,) n . 

From the independence of the random variables A VjW , it follows that the discovery times {r^ : w £ W} are 
independent. We now focus on describing the distribution of <p t = Y[ a aw Qa- For t > 0, we have 

* = n «« = n n *° = n n <£ (iw) = n ^- t} - («> 

aeW[ t ] j=0a£s(vj)\S[j-l] j=0w£W wGW 

Using the fact that f or a B ~ Bernoulli(r), the expectation K[a B ] = 1 — (1 — a)r, we can easily calculate the 
expectation of <fi t 

em = nu$ Vw - t} ]= II (i-(i-feW«<*] 



= [] (l-Cl-foKl-g^ 1 )). (20) 

The concentration of <^o will be crucial for the analysis of the supercritical regime, Subsection 15.21 Hence, we 
here provide E[(/> ] and E[</>q]. From d20l it follows 

n<h] = n (i - p 2 j = i - E ^ + ( E *£)• ^ 21 > 

Moreover, from ( fl9l ) it follows 

i« = e[ n ^ l(ri "- o) ] = n - a - ^^ = °o = n ( x - (! - <^ 

mew weVF wew 



l[ h-2p 2 w + p 3 w ) =1-2^ Pi + 0(Y,P^- W 



wew w&w w&w 



5 Giant component 

With the process {Y t }t>o defined in the previous section, we analyze both the subcritical and supercritical regime 
of our random intersection graph by adapting the percolation based techniques to analyze Erdos-Renyi random 
graphs 0. The technical difficulty in analyzing that stopping time rests in the fact that the distribution of Y t 
depends on the history of the process, dictated by the structure of the general RIG. In the next two subsections, 
we will give conditions on non-existence, that is, on existence and uniqueness of the giant component in general 
RIGs. 



5.1 Subcritical regime 
Theorem 5 Let 



2 Vw = 0{l/n ) and p w = 0(l/n)forallw. 



wew 



For any positive constant c < 1, if YlweW Pw — c / n > then all components in a general random intersection 
graph G(n, m, p) are of order 0(log n), with high probability^ 

Proof We generalize the techniques used in the proof for the sub-critical case in G n>p presented in [2]. Let 
T(t>o) be the stopping time define in ©, for the process starting at node vq and note that T(vq) = \C(vq)\. We 
will bound the size of the largest component, and prove that under the conditions of the theorem, all components 
are of order O(logn), whp. 



For all t > 0, 



P[T(«o) X] = E [P p> ) > t | Ht}} < E [¥[Y t > | H t ]} 

t-i 
»[Bin(n-l,l- JJ(1 - r T )) > t \ H t ] 



E 



(23) 



r=0 

Bounding from above, which can easily be proven by induction in t for r T G [0, 1], we have 

t-l t-l t-l 

1 - IL 1 - r -) ^ Y. r - = 5D(^-i - &-) = 1 - ^-i- < 24 ) 

T=0 T = T = 

By using stochastic ordering of the Binomial distribution, both in n and in ^ T =o r T-> anc ^ f° r an y positive constant 
v, which is to be specified later, it follows 

t-l 

F[T(v ) >t\H t ] < P[Bin(n, ^ r T ) > t \ H t ] = P[Bin(n, 1 - &-i) > (1 - v)t \ H t ] 

T = 

= P[Bin(n, 1 - 4> t -x) > t | 1 - 4>t-i < (1 - v)t/n n H t ]F[l - $ t -i < (1 - u)t/n \ H t ] 
+ P[Bin(n, 1 - t _i) > t | 1 - &_i > (1 - v)t/n n H t ]P[l - &_i > (1 - i/)t/n | Ht] 

< P[Bin(n, 1 - &-i) > t | 1 - t _i < (1 - ^)t/n D %] 

+P[1 - 0t_! > (1 - v)t/n | Ht]. (25) 

Furthermore, using the fact that the event {1 — <ft t -i < (1 — v)t/n} is T^-measurable, together with the stochastic 
ordering of the binomial distribution, we obtain 

P[Bin(n, 1 - <j> t _{) > t | 1 - t -i < (1 - *0*A* H %] < P[Bin(n, (1 - zy)t/n) > t | %], 



2 We will use the notation "with high probability" and denote as whp, meaning with probability 1 — o(l), as the number of nodes 

n — i> 00. 



Taking the expectation with respect to the history % t in d25l ) yields 

P[T(«o) > t] < P[Bin(n, (1 - v)t/n) >t] + P[l - <p t -i > (1 - *0*/n]. 

For t = i^ologn, where Kq is a constant large enough and independent on the initial node vq, the Chernoff 
bound ensures that P[Bin(n, (1 - v)t/n) >t] = o(l/n). To bound P[l - fa-i > (1 - z^)i/n | H t ], use CE9) to 
obtain 



{1 - <t>t-l > (1 - v)t/n 



= (Eiog(^-)n(r 

Lew Vi PwJ 



< t) > - log 1 



(l-iz)t 



n 



Linearize — log(l — (1 — v)t/n) = (1 — v)t/n + o(t/n) and define the bounded auxiliary random variables 
Xt lW = nlog(l/(l — p w ))I(T w < t). Direct calculations reveal that 



E[X 



[X tiW ] = nlog(— Yl-ql) = n(p v , + 0(jPw))(l-O--p v >) t ) 

= n(p w + 0{p w )\ (tp w + 0(tp w ))) = ntp 2 w + 0\ntp 2 c ) , 



which implies 



Y, E[Xt, w ] = nt Y, Pi + 0[rit ^ pi 

wew wew wew 

Thus under the stated condition that 

n ^2 pi < c < 1, 

toeVK 

it follows that < (1 — c)t < t — Y^weW ^[Xt,w]- m light of Bernstein's inequality [5], we bound 



(26) 
(27) 



P[l - &_! > (1 - i/)t/n] 



Y, x t, w > (1 - ^ 



< exp 



K(l-^-c)t) 2 



< 



3 E we vy Var[X tilu ] + nimax l0 {p w }(l + 0(1)) 



^ (X t)10 - E[X tiW \) >(l-u-c)t 

(28) 



.wGVK 



Since 



nxi 



re log 



l-p% 



(1 - ql) = n 2 (p w + 0(p w j) (I -(I -p. V 



n' 



! (j£ + 0(pS,)) (*Pw + 0(tPw))) = n 2 tpl + 0(n 2 t Y P' 



w£W 



3 
w 1 ' 



(29) 



it follows that for some large constant K\ > 

2 Var[X iU ,] < ^ E[X 2 J = n 2 i £>* + 0(n 2 t ]T p*,) < ivU 



tueVK tueVK mew u>eVK 

Finally, the assumption of the theorem implies that there exists constant K2 > such that 

re max p w < K2 ■ 

wew 



Substituting these bounds into d28l ) yields 



P[l - ^ > (1 - u)t/n] < exp ( - 3 2 ( ( ^ + ffi ) , 



and taking v € (0, 1 — c) and t = K% log n for some constant i^ 3 large enough and not depending on the initial 
node vq, we conclude that P[l — <\>t~\ > (1 — u)t/n] = ofa' 1 ), which in turn implies that taking constant 
K4 = maxjiCo, -K3}, ensures that 

P[T(t> ) > K 4 logn] = 0(l/n) 

for any initial node v . Finally, a union bound over the n possible starting values vq implies that 

P[maxT(>o) > if 4 logn] < n0(n~ 1 ) = o(l), 
vo&V 

which implies that all connected components in the random intersection are of size 0(log n), whp. ■ 

Remarks. We now consider the conditions of the theorem. From the Cauchy-Schwarz inequality, we obtain 
TtwewPw) ( E,w€WP™) ^ ( YtwewPw) • Moreover, given that J2weW pI = 0(l/n 2 ) andp w = 0(l/n), 



it follows J2wewPw = ^(y/m/n 3 ). Hence, for YlwewPw = c / n > wnen c < 1, it follows m = O(n), which is 
consistent with the results in Q on the non-existence of a giant component in a uniform RIG. 



5.2 Supercritical regime 

We now turn to the study of the supercritical regime in which lin^^oo n YlwswPw = c > 1- 

Theorem 6 Let 

V^ 3 /logn\ , /logn\ 

2_,Pw = \ — 2~) an d Pw = oi I, for all w. 

wew 

For any constant c > 1, if ^2 W £wPw — c / n ' t ^ ien W ^P there exists a unique largest component in G(n, m,p), 
of order 0(n). Moreover, the size of the giant component is given by nQ c (l + 0(1)), where Q c is the solution in 
(0, 1) of the equation 1 — e _c< > = Q, while all other components are of size 0(log n). 

Remarks. The conditions on p w and ^2 W p^ are weaker than ones in the case of the sub-critical regime. 

The proof proceeds as follows. The first step is to bound, both from above and below, the value 1 — nt=o ( -*- ~~ Tr ) 
that governs the behavior the branching process {Y t }t>o, see Lemma [3] With the lower bound, we show the 
emergence with high probability of at least one giant component of size O(n). We use the upper bound to prove 
uniqueness of the giant component. Technically, we make use of these bounds to compare our branching process 
to branching processes arising in the study of Erdos-Reneyi random graphs. 

Proof We start by bounding 1 — PIt^oU ~~ r 0- The upper bounds X^t=o r T nas Deen previously established 
in d24l ). For the lower bound, we apply Jensen's inequality to the function log(l — x) to get 

t-x t-i t-i 

logJ](l-r T ) = £log(l-ry)=J>g(l-(&._ 

t=0 t=0 r=0 

t-1 

< t log (l - 7 £(&._i -&-))=< log (l pi). (30) 

In light of (fl9l ). 4> t is decreasing in t, and hence 



t 

T=0 



t=0 r=0 



To further bound 1 — ( 1 — — ^ ) , consider the function ft(x) = 1 — (1 — x/t) 1 for x in a neighborhood of the 

origin and t > 1. For any fixed x, ft(x) decreases to 1 — e~ x as t tends to infinity. The latter function is concave, 
and hence for all x < e, 

x < f t (x). 

Note that (1 — e~ £ )/e can be made arbitrary close to one by taking e small enough. Furthermore, ft{x) is 

increasing in x for fixed t. From ( fT9l >, 1 — <f>Q < 1 — <f> t , hence 1 — (1 -j-)* < 1 — (1 ~t^Y- Looking 

closer at 1 — <j)Q, from (l22l) and (T2TI ). by using Chebyshev inequality, with J2 w ewPw = c / n > it follows that 4>q, 
is concentrated around its mean E[</>o] = c/n. That is, for any constant 5 > 0, 4>$ G ((1 — S)c/n, (1 + S)c/n), 
with probability 1 — o(l/n). We conclude that for any 5 > there is e > such that (c — #)— - — > 1, since 
constant c > 1. Moreover, since lim e _>o — - — = 1, by choosing e sufficiently small, 1-e can be arbitrarily 
close to 1. It follows that 1 — f3 7 Z =0 (l — r T ) > d jn, for some constant c > d > 1 arbitrarily close to c. Hence, 
the branching process on RIG is stochastically lower bounded by the Bin(n — l,d/n), which stochastically 
dominates a branching process on G n c i i n . Because d > 1, there exists whp a giant component of size 0(n) 
in G nc // n . This implies that the stopping of the branching process associated to G n ]C // n is 0(n) with high 
probability, and so is the stopping time T v for some v G V, which implies that there is a giant component in a 
general RIG, whp. 

Let us look closer at the size of that giant component. From the representation (fl9l > for <fit-i, consider the 
previously introduced random variables X ttW = nlog(l/(l — p w ))I(T w < t). Similarly, as in the proof of 
the Theorem |5l it follows that under the conditions of the theorem there is a positive constant 5 > such that 
Ylw Xt,w is concentrated within (1 ± 6) J2 W ^[Xt,w] = (1± 5)c/n, with probability 1 — o(l). Hence, there 
exists p + = c + /n, for some constant c + > c > 1, such that 1 — 4>t-i < 1 — (1 — P + Y, which is equivalent to 
— Iog0i_i < ilog(l — p + ) = tp + + 0(tp + ) = tc + jn + 0{t/n). Similarly, the concentration of 4>t-\ implies 
that there exists p~ = c~ jn, with c > c~ > 1, such that 1 — (1 —p~) 1 < 1 — (1 — (1 — (frt-ij/t) 1 , which implies 
that — log 4>t-i > t l°g(l — p) = tp~ + 0(tp) = tc~ /n + 0(i/n). Combining the upper and lower bound, we 
conclude that with probability 1 — o(l), the rate of the branching process on RIG is bracketed by 

t-x 
1 - (1 - p-f < 1 - l[(l - r T ) < 1 - (1 - p + ) 1 . (32) 

r=0 

The stochastic dominance of the Binomial distribution together with (l32l ). implies 

i-l 
~^)>t < P Bin(ra- 1,1- j_ j , ., . - , r ^._ > 

T=0 

(33) 



Bin n-l,l-(l-p")M >t 



< p 


Bin 1 n - 


-1,1- 


-H(l-r T ))>t 

r-0 


< p 


Bin In - 


-1,1- 


-(i- P + y)>t 





In light of (l32l . the branching process {Y t }t>o associated to a RIG is stochastically bounded from below and 
form above by the branching processes associated to G np - and G np +, respectively (for the analysis on an 
Erdos-Renyi graph, see O). Since both c~,c + > 1, there exist giant components in both G np - and G np +, 
whp. 



In E2l . it has been shown that the giant components in G n w n , for A > 1, is unique and of size sa n(,\, where £a 
is the unique solution from (0,1) of the equation 

1 - e" AC = C- (34) 

Moreover, the size of the giant component in G n \/ n satisfies the central limit theorem 

max v {\C(v)}\ - C\n Aj^(q Ca(1 - Ca) \ (35) 



V ' (1 - A + XCxf 



From the definition of the stopping time, see d23l ), and since d33l > and d35l >, it follows there is a giant component 
in a RIG, of size, at least, n(\(l — 0(1)), whp. Furthermore, the stopping times of the branching processes 
associated to G np - and G n „+ are approximately (n, where C satisfy d34l ), with A~ = np~ and A + = np + , 
respectively. These two stopping times are close to one another, which follows from analyzing the function 
F{(i c ) = 1 — C ~~ e~ c< *, where (£, c) is the solution of F((, c) = 0, for given c. Since all partial derivatives of 
F(C, c) are continuous and bounded, the stopping times of the branching processes defined from G np -, G np + 
are 'close' to the solution of (l34l >. for A = c. From (|33T >. the stopping time of a RIG is bounded by the stopping 
times on G n „-, G np +. 

We conclude by proving that whp, the giant component of a RIG is unique by adapting the arguments in [2] to our 
setting. Let us assume that there are at least two giant components in a RIG, with the sets of nodes Vi, V2 C V. 
Let us create a new, independent 'sprinkling' RIG on the top of our RIG, with the same sets of nodes and 
attributes, while p w = p w , for 7 > 1 to be defined later. Now, our object of interest is RlG new = RIG U RIG. Let 
us consider all 0(ra 2 ) pairs {^1,^2}, where v\ € V\, V2 G V2, which are independent in RIG, (but not in RIG), 
hence the probability that two nodes v\ , V2 £ V are connected in RIG is given by 

1 -IR 1 -&) = l -Ik 1 -p 2 ^ = E# +*(£#)■ < 36 > 

w www 

which is true, since 7 > 1 and p w = 0(l/n) for any w. Given that J2 w Pw = c / n > we choose 7 > 1 so that 
Ylw P™ = w (l/^ 2 )- Now, by the Markov inequality, whp there is a pair {v\ , V2} such that v\ is connected to V2 
in RIG, implying that V\, V2 are connected, whp, forming one connected component within RIG net0 . From the 
previous analysis, it follows that this component is of size at least 2n(\(l — 5) for any small constant 5 > 0. On 
the other hand, the probabilities p^ ew in RlG new satisfy 

p new = 1 _ (1 -p w ){\ -p w ) =p w +p w (l -p w ) =p w +pl(l ~Pw) =Pw(l +0(1)), 

which is again true, since 7 > 1 and p w = 0(l/n) for any w. Thus, 

£ (pro 2 = E pI + ®( E p» +7 (i - p»)) = E p 2 ^ 1 + ( x )) = c / n + °(Vn)- (37) 

w£W w£W w£W w£W 

Given that the stopping time on RIG is bounded by the stopping times on G np -, G np +, and from its continuity, 
it follows that the giant component in RlG new cannot be of size 2n(\(l — 5), which is a contradiction. Thus, 
there is only one giant component in RIG, of size given by n£ c (l + 0(1)), where Cc satisfies (l34l >. for A = c. 
Moreover, knowing behavior of G n>p , from (|33T >. it follows that all other components are of size 0(log n). ■ 



6 Conclusion 

The analysis of random models for bipartite graphs is important for the study of social networks, or any network 
formed by associating nodes with shared attributes. In the random intersection graph (RIG) model, nodes have 
certain attributes with fixed probabilities. In this paper, we have considered the general RIG model, where these 
probabilities are represented by a set of probabilities p = {p w }wew> where p w denotes the probability that a 
node is attached to the attribute w. 

We have analyzed the evolution of components in general RIGs, giving conditions for existence and uniqueness of 
the giant component. We have done so by generalizing the branching process argument used to study the birth of 
the giant component in Erdos-Renyi graphs. We have considered a dependent, inhomogeneous Galton-Watson 
process, where the number of offspring follows a binomial distribution with a different number of nodes and 
different rate at each step during the evolution. The analysis of such a process is complicated by the dependence 
on its history, dictated by the structure of general RIGs. We have shown that in spite of this difficulty, it is possible 
to give stochastic bounds on the branching process, and that under certain conditions the giant component appears 
at the threshold n J2wew Pw = 1> with probability tending to one, as the number of nodes tends to infinity. 
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